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Abstract 

This paper establishes minimax rates for online regression with arbitrary classes of functions and general 
losses.^ We show that below a certain threshold for the complexity of the function class, the minimax rates de¬ 
pend on both the curvature of the loss function and the sequential complexities of the class. Above this threshold, 
the curvature of the loss does not affect the rates. Furthermore, for the case of square loss, our results point to the 
interesting phenomenon: whenever sequential and i.i.d. empirical entropies match, the rates for statistical and 
online learning are the same. 

In addition to the study of minimax regret, we derive a generic forecaster that enjoys the established optimal 
rates. We also provide a recipe for designing online prediction algorithms that can be computationally efficient 
for certain problems. We illustrate the techniques by deriving existing and new forecasters for the case of hnite 
experts and for online linear regression. 


1 Introduction 

We study the problem of predicting a real-valued sequence yi,..., in an on-line manner. At time the 

forecaster receives side information in the form of an element Xt of an abstract set SC. The forecaster then makes 
a prediction ft on the basis of the current observation Xt and the data {(x,-,y,)}l“J encountered thus far, and then 
observes the response y^. 

Such a problem of sequence prediction is studied in the literature under two distinct settings: probabilistic 
and deterministic [18]. In the former setting, which falls within the purview of time series analysis, one posits a 
parametric form for the data-generating mechanism and estimates the model parameters based on past instances 
and input information in order to make the next prediction. In contrast, in the deterministic setting one assumes 
no such probabilistic mechanism. Instead, the goal is phrased as that of predicting as well as the best forecaster 
from a benchmark set of strategies. This latter setting—often termed prediction of individual sequences, or online 
learning —is the focus of the present paper. 

We let the outcome yt and the prediction ft take values in '3^ c |R and ‘W qU, respectively. Formally, a deter¬ 
ministic prediction strategy is a mapping (.SK" x y-X . We let the loss function (yoyt) ^Ifttyt) score 

the quality of the prediction on a single round. 

Assume that the time horizon ne Z+, is known to the forecaster. The overall quality of the forecaster is then 
evaluated against the benchmark set of predictors, denoted as a class ^ of functions X ^ S!/. The cumulative 
regret of the forecaster on the sequence (xi, yi),..., {Xn,yn) is defined as 

Y,^ifixt],yt)- ( 1 ) 

f=l /e.^f=l 

The forecaster aims to keep the difference in (1) small for all sequences (xi,yi),..., (x„,y„). 

The comparison class ^ encodes the prior belief about the family of predictors one expects to perform well. If a 
forecasting strategy guarantees small regret for all sequences, and if is a good model for the sequences observed 
in reality, then the forecasting strategy will also perform well in terms of its cumulative error. In fact, we can take 

Whis paper builds upon the study of online regression with square loss, presented by the authors at the COLT 2014 conference. 
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^ to be a class of solutions (that is, forecasting strategies) to a set of probabilistic sources one would obtain by 
positing a generative model of data. By doing so, we are modeling solutions to the prediction problem rather than 
modeling the data-generating mechanism. We refer to [18, 21] for further discussions on this “duality” between 
the probabilistic and deterministic approaches. 

To ensure that captures the phenomenon of interest, we would like ^ to be large. However, increasing 
the “size” of likely leads to larger regret, as the comparison term in (1) becomes smaller. On the other hand, 
decreasing the “size” of S' makes the regret minimization task easier, yet the prediction method is less likely to be 
successful in practice. This dichotomy is an analogue of the bias-variance tradeoff commonly studied in statistics. 
A contribution of this paper is an analysis of the growth of regret (with ri] in terms of various notions of complexity 
of The task was already accomplished in [24] for the case of absolute loss €[a,b) - \a-b\. In the present paper 
we obtain optimal guarantees for convex Lipschitz losses under very general assumptions. 

To give the reader a sense of the results of this paper, we state the following informal corollary. Let complexity 
of S be measured via sequential entropy at scale p, to be defined below. (For the reader familiar with covering 
numbers, this is a sequential analogue—introduced in [24] —of the classical Koltchinskii-Pollard entropy). 

Corollary 1 (Informal). Suppose sequential entropy at scale p behaves as 0{p~P), p>0. Then optimal regret 

• for prediction with absolute loss grows as n^^^ ifp e (0,2), and as for p > 2; 

• for prediction with square loss grows as ^i-2/(2+p) ijp ^ 2), and as n^~^'P for p>2. 

Moreover, these rates have matching, sometimes modulo a logarithmic factor, lower bounds. 

The first part of this corollary is established in [24]. The second part requires new techniques that take advan¬ 
tage of the curvature of the loss function. 

In an attempt to entice the reader, let us discuss two conclusions that can be drawn from Corollary 1 . First, 
the rates of convergence match optimal rates for excess square loss in the realm of distribution-free Statistical 
Learning Theory with i.i.d. data, under the assumption on the behavior of empirical covering numbers [27]. Hence, 
in the absence of a gap between classical and sequential complexities (introduced later) the regression problems 
in the two seemingly different frameworks enjoy the same rates of convergence. A deeper understanding of this 
phenomenon is of a great interest. 

The second conclusion concerns the same optimal rate n~^'P for both square and absolute loss for “rich” 
classes (p > 2). Informally, strong convexity of the loss does not affect the rate of convergence for such massive 
classes. A geometric explanation of this interesting phenomenon requires further investigation. 

We finish this introduction with a note about the generality of the setting proposed so far. Suppose 3T = 
Ut^n'3^^, the space of all histories of '3''-valued outcomes. Denoting Xf = {yi,...,yt-i) = y*~^, we may view each 
/ E itself as a strategy that maps history y^~^ to a prediction. Ensuring that Xt is not arbitrary but consistent 
with history only makes the task of regret minimization easier; the analysis of this paper for this case follows along 
the same lines, but we omit the extra overhead of restrictions on Xt’s and instead refer the reader to [14, 21]. 

The paper is organized as follows. Section 2 introduces the notation and then presents a brief overview of 
sequential complexities. Upper and lower bounds on minimax regret are established in Sections 3 and 4. We 
calculate minimax rates for various examples in Section 5. We then turn to the question of developing algorithms 
in Section 6. We first show that an algorithm based on the Rademacher relaxation is admissible (see [19]) and yields 
the rates derived in a non-constructive manner in the first part of the paper. We show that further relaxations in 
finite dimensional space lead to the famous Vovk-Azoury-Warmuth forecaster. We also derive a prediction method 
for finite class 

2 Preliminaries 

2.1 Assumptions and Definitions 

We assume that the set of outcomes 3^ is a bounded set, a restriction that can be removed by standard truncation 
arguments (see e.g. [12]). Let 3T be some set of covariates, and let be a class of functions for some 
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S'" c 05. Recall the protocol of the online prediction prohlem: On each round t e is revealed to the 

learner who subsequently makes a prediction j/f e ^■ The response yt e is revealed after the prediction is made. 

The loss function £{-,y) is assumed to be convex. Let da£(a,y) denote any element of the subdifferential set 
(with respect to first argument), and assume that 

sup |da^(a,y)| < G< oo. 

a€^,y£'3/' 

We assume that for any distribution of y supported on W, there is a minimizer of expected loss that is finite and 
belongs to ^: 

3'' n argmin E£{y, y)^0. 
yeR 

Given a y e 3^, the error of a linear expansion at a to approximate function value at b is denoted by 

- £(b,y)-l£{a,y) + da£(a,y)-{b-a)]. 

Let A : (^3'' - 3^) ^ [R>o be a function defined pointwise as 

A(x) = ^ inf (2) 

a,be^,ye^ s.t. b-a=x 

a lower bound on the residual for any two values separated by x. For instance, an easy calculation shows that 
A(x) = for ^(y,y) = (y- yf. 

2.2 Minimax Formulation 

Unlike most previous approaches to the study of online regression, we do not start from an algorithm, but instead 
work directly with minimax regret. We will be able to extract a (not necessarily efficient) algorithm after obtaining 
upper bounds on the minimax value. Let us introduce the notation that makes the minimax regret definition more 
concise. We use ((■ • • to denote an interleaved application of the operators, repeated over t = l...n rounds. 
With this notation, the minimax regret of the online regression problem described earlier can be written as 

Vn = //supinfsup\\ I Y, ^(yf.yt) - inf Y 
W h yt II lt=i fe^t=\ 

where each Xf ranges over ft ranges over ^, and yt ranges over '3^. An upper bound on Vn guarantees the 
existence of an algorithm (that is, a way to choose yj’s) with at most that much regret against any sequence. A 
lower bound on Vn, in turn, guarantees the existence of a sequence on which no method can perform better than 
the given lower bound. 

2.3 Sequential Complexities 

One of the key tools in the study of estimators based on i.i.d. data is the symmetrization technique [13]. By in¬ 
troducing Rademacher random variables, one can study the supremum of an empirical process conditionally on 
the data. Conditioning facilitates the introduction of sample-based complexities of a function class, such as an 
empirical covering number. For a class of bounded functions, the covering number with respect to the empirical 
metric is necessarily finite and leads to a correct control of the empirical process even if discretization of the func¬ 
tion class in a data-independent manner is impossible. We will return to this point when comparing our approach 
with discretization-based methods. 

In the online prediction scenario, symmetrization is more subtle and involves the notion of a binary tree. The 
binary tree is, in some sense, the smallest entity that captures the sequential nature of the problem. More precisely, 
a 2'-valued tree z of depth n is a complete rooted binary tree with nodes labeled by elements of a set 2. Equiva¬ 
lently, we think of z as n labeling functions, where z\ is a constant label for the root, Z 2 (-1), Z 2 (- 1 -1) e are the labels 


£{f{xt),yt) 


(3) 
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for the left and right children of the root, and so forth. Hence, for e- (ci,.. .,en) £{+!}“, Ztic) = Zf(ei,. e 2 

is the label of the node on the t-th level of the tree obtained by following the path e. For a function g: ^ B?, g(z) 

is an IR-valued tree with labeling functions g o Z( for level t (or, in plain words, evaluation of g on z). 

We now define two tree-based complexity notions of a class of functions. 

Definition 1 ([24]). Sequential Rademacher complexity of a class ^ c on a given .^-valued tree x of depth n, 
as well as its supremum, are defined as 


iHn(.^;x) = Esup 


^et/(xt(c)) 


t=i 


= supiR„(,^;x) 

X 


(4) 


where the expectation is over a sequence of independent Rademacher random variables e- (ci,..., £„). 

One may think of the functions xi,..., x„ as a predictable process with respect to the dyadic filtration {cr(ei,...,ef)}t>i. 
The following notion of a jS-cover quantifies complexity of the class ^ evaluated on the predictable process. 

Definition 2 ([24]). A set V of K-valued trees of depth n forms a fi-cover (with respect to the £q norm) of a function 
class c on a given 2 -valued tree x of depth n if 


V/e,^,V£:E{+l}”,3vEy s.t. 


-f;i/(xt(e))-vt(e)|4<^4. 
” t=i 


A/3-cover in the £^0 sense requires that |/(xt(e))-Vf(e)| < /3 for all f e [n]. The size of the smallest /1-cover is denoted 
hy andJVqiP,^,}!) =swp^2'q{p,^,x). 

We will refer to logJVqip,^, n) as sequential entropy of . In particular, we will study the behavior of Vn when 
sequential entropy grows polynomiaUy^ as the scale /I decreases: 

,n) ~ , p>0. (5) 

We also consider the parametric “p = 0” case when sequential covering itself behaves as 

J£2{P,^,n)~ ( 6 ) 


(e.g. linear regression in a bounded set in K'^). We remark that the ^00 cover is necessarily n-dependent, so the 
forms we assume for nonparametric and parametric cases, respectively, are 

logJ£^[p,3^,n) ~ p~Plog{nlP) or ,n] ~ [nlp]‘^ (7) 


3 Upper Bounds 

The following theorem from [24] shows the importance of sequential Rademacher complexity for prediction with 
absolute loss. 

Theorem2 ([24]). LetW - [-1,1], = [-1,1]'*^, and £{y,y) = \y- y\. It then holds that 

Furthermore, an upper bound of 2GiH„(,^) holds for any G-Lipschitz loss. We observe, however, that as soon 
as ^ contains two distinct functions, sequential Radmeacher complexity of scales as Q(n^^^). Yet, it is known 
that minimax regret for prediction with square loss grows slower than this rate. Therefore, the direct analysis based 
on sequential Rademacher complexity (and a contraction lemma) gives loose upper bounds on minimax regret. 
The key contribution of this paper is an introduction of an offset Rademacher complexity that captures the correct 
behavior. 

In the next lemma, we show that minimax value of the sequential prediction problem with any convex Lipschitz 
loss function can be controlled via offset sequential Rademacher complexity. As before, let e = (ei,...,e„) where 
each Cj is an independent Rademacher random variable. 

^It is straightforward to allow constants in this definition, and we leave these details out for the sake of simplicity. 
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Lemmas. Under the assumptions and definitions in Section 2.1, the minimax rate is bounded by 


Vn < sup E sup 


^ 2Get(/(X((£-)) - /Xf(e)) - A[/(xt(e:)) - 

f=i '■ ' 


( 8 ) 


wherex and fi range over all X-valued and ^-valued trees of depth n, respectively. 

The right-hand side of (8) will be termed offset Rademacher complexity of a function class c with re¬ 
spect to a convex even offset function A : IR ^ IR>o and a mean R-valued tree p. If A = 0, we recover the notion of 
sequential Rademacher complexity since ElctPfie)] = 0. 

A matching lower bound on the minimax value will be presented in Section 4, and the two results warrant 
a further study of offset Rademacher complexity. To this end, a natural next question is whether the chaining 
technique can be employed to control the supremum of this modified stochastic process. As a point of comparison, 
we first recall that sequential Rademacher complexity of a class of [-1,1]-valued functions on 3. can be upper 
bounded via the Dudley integral-type bound 

< inf l4pn-\-12\/7i f \J\ogJ/ 2 [ 8 ,'^,z)d 8 \, (9) 

pe(o,i| I Pp ’ J 

for any 2 -valued tree z of depth n, as shown in [26] . We aim to obtain tighter upper bounds on the offset Rademacher 
by taking advantage of the negative offset term. 

To initiate the study of offset Rademacher complexity with functions A other than quadratic, we recall the 
notion of a convex conjugate. 

Definition 3. For a convex function @ ^ IR with domain 3i qU, the convex conjugate i/a* : IR ^ IR u {-too} is 
defined as 

y/* (a) = sup {fld - y/{d)}. 
d€Si 

The chaining technique for controlling a supremum of a stochastic process requires a statement about the be¬ 
havior of the process over a finite collection. The next lemma provides such a statement for the offset Rademacher 
process. 

Lemma 4. Let A be a convex, nonnegative, even function onUand letT* denote the convex conjugate of the function 
X ^ A(vT^)- AssumeT* is nondecreasing. For any finite setW ofU-valued trees of depth n and any constant C > 0, 

Emaxi y 2CetWt(c) - A (Wf (£:))l < inf I — logi W| -I-u r* (2C^A]l. (10) 

I A>oU ' 

Further, for any [- G, G] -valued tree tj. 


Esup 

crpC0 


^efg(Z((e)) 


Emax 

weW 


n 


t=\ 


< G 


21og|W|- 


max y' ■ 


( 11 ) 


As an example, if A(x) = an easy calculation shows that F* (1) = 0 and F* (y) = -i-oo for any y ^ 1. Hence, the 
infimum in (10) is achieved at A = 1/(2C^), and the upper bound becomes 2C^log| W|. 

We can now employ the chaining technique to extend the control of the stochastic process beyond the finite 
collection. 


Lemma 5. Let A and!* be as in Lemma 4. For any 2 - valued treez of depth n and a class'^ of functions 2 IR and 

any constant C > 0, 

n 

Esup ^2Ccfg(Zf(c))-A(g(zt(c))) 

ge® f=l 

-I- inf M log^A^o (|,^^,z) -I- u F* (2C^A) 

A>o I 'f 




5 













Remark 1. For the case o/A(x) = x^, it is possible to prove the upper bound of Lemma 5 in terms of £2 sequential 
covering numbers rather than (see [22]). 

Lemma 5, together with Lemma 3, yield upper bounds on minimax regret under assumptions on the growth of 
sequential entropy. Before detailing the rates, we present lower bounds on the minimax value in terms of the offset 
Rademacher complexity and combinatorial dimensions. 


4 Lower Bounds 

The function A, arising from uniform (or strong) convexity of the loss function, enters the upper bounds on mini¬ 
max regret. For proving lower bounds, we consider the dual property, that of (restricted) smoothness. To this end, 
let S c ^3'' be a subset satisfying the following condition: 

VseS, 3yi(s),y2(5)e'3^ s.t. se argirnn i (^(y,yi(s))-t ^(y,y 2 (s))). (12) 

ye'Sr ^ 

For any such subset S, let As : (S'^ - S) ^ IR>o be defined as 

As(x)= ^up max|A^'j|*\Aj^j|*’|. (13) 

seSybe^ s.t. b-s=x 


We write A^ for the singleton set S = {jc}. 

The lower bounds in this section will be constructed from symmetric distributions supported on two carefully 
chosen points. Crucially, we do not require a uniform notion of smoothness, but rather a condition on the loss that 
holds for a restricted subset S and a two-point distribution. 

As an example, consider square loss and '3^ = 3^ = For any se%£ , we may choose the two points as 

s + 5 eW, for small enough 6 , with the desired property. Then S-‘d/ and Aslx) = y?. 

Lemma 6. Fix R>0. Suppose satisfies condition (12), and suppose that for any se S, 

d£[s,yi[s))^+R, d£{s,y 2 [s))^-R. 


Then for any S-valued tree p of depth n, 


Vn > sup E sup 

X 


^ CtR (/(Xf(e)) - Pt^e]) - A^^(e) (/(xt(c)) - Piie)] 


t=i 


(14) 


The lower bound in (14) is an offset Rademacher complexity that matches the upper bound of Lemma 3 up 
to constants, as long as functions A and A exhibit the same behavior. In particular, the upper and lower bounds 
match up to a constant for the case of square loss. 

Our next step is to quantify the lower bound in terms of n according to “size” of In contrast to the more com¬ 
mon statistical approaches based on covering numbers and Fano inequality, we turn to a notion of a combinatorial 
dimension as the main tool. 


Definition 4. An ^-valued tree of depth d is said to be ;6-shattered by if there exists an K-valued tree s of depth 

d such that 

VeE{+l}'^, 3 /e.^ s.t. edf{xt(e))-st{e))>pi2 

for all f E {1,..., d}. The tree s is called a witness. The largest d for which there exists a ;6-shattered 3F -valued tree is 
called the (sequential) fat-shattering dimension, denoted by fat^gf.^). 

The reader will notice that the upper bound of Lemma 5 is in terms of sequential entropies rather than combi¬ 
natorial dimensions. The two notions, however, are closely related. 
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Theorem? ([26]). Let^ be a class of functions SC [-1,1]. Forany f> 0, 

As a consequence of the above theorem, if log.yl 2 (; 6 ,,^, n) > {d p)P and p> II n, then fat^gC,^) > [c' / I login) 
where c, c' may depend on the range of functions in 

The lower bounds wiU now be obtained assuming fat^gC,^) > behavior of the fat-shattering dimension, 
and the corresponding statements in terms of the sequential entropy growth will involve extra logarithmic factors, 
hidden in the Q(-) notation. 

Lemma 8 . Suppose the statement of Lemma 6 holds for some R>0, and suppose 

Afi^loiPtie) - fix,{£))) < ^Ipfie)-fixtie))\ (15) 

for any f and p,,xin the statement of Lemma 6. Then it holds that for any p>0 and n = fat^g i^) , 

Vn>(R/2)np. 


In particular, if fat^ i^) > f P for p>Q, we have 

-Vn^iRI2)n-^'P . 
n 

As an example, consider the case of square loss with 'SI = [-iJ,iJ]. Then we may take S = {0], yi - B, y 2 - -B, 
and hence R - 2B. We verify that (15) holds for SI - [-B/2,B/2]. 

Lemma 9. Suppose the statement of Lemma 6 holds for some R > 0. For any class and f > 0, there exists a 
modified class ^ such that for all f < f, fat^/(,^') < fat^;(,^) < 2fat^/(,^') -i- 4 and for n > fatpi^), 


1 

-V„> sup 

tt R_k 


2 


fatpi^) 

2 n 


- Ax 


Armed with the upper bounds of Section 3 and the lower bounds of Section 4, we are ready to detail specific 
minimax rates of convergence for various classes of regression functions and a range of loss functions £. 


5 Minimax Rates 


Combining Lemma 3 and Lemma 5, we can detail the behavior of minimax regret under an assumption about the 
growth rate of sequential entropy. 

Theorem 10. Let r>2,p>0 and suppose the loss function and the function class are such that 

Ait] > Kf, logJIaoif.^jU) < p~Plogin/f). 


Then for p e (0,2), 

— Vn< min I Cr,p n“ 2 (r-i)+p Q 2 i,r-i)+p 2 ir-i)+p login) , c^Glog^^^(n)n“^^^|. (16) 

and for p>2 

-Vn<CpGlog^'^in)n-^‘P (17) 

n 

Here, cg: depends on supI/loo- Atp-2 , the bound (17) gains an extra log(n) factor. 
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We match the above upper bounds with lower bounds under the assumption on the growth of the combinato¬ 
rial dimension. 

Theorem 11. Suppose the statement of Lemma 6 holds for some R > 0 and k e S Let r >2, p e (0,2), and 

assume 

AK{/3/2)<^:/3^ fat^>/3“P. 

Then there exists a function class such that for some constant Cp^r > 0, 

1 f_ r _ _ 2r __2 -p 1 

-Vn>Cp,rmmln 2ir-i)+p p2ir-i)+p ^ 2ir-iHp ^ Rn~^'‘^'r. 

for p E (0,2). Furthermore, for p>2, for any with fat^ > 

-Vn>iRl2)n-^'P 

n 

under the assumption (15). 

The lower bound of Theorem 11 matches (up to polylogarithmic in n factors) the upper bound of Theorem 10 
in its dependence on n, the dependence on the constant K, and in dependence on the size of the gradients G 
(respectively, R). The rest of this section is devoted to the discussion of the derived upper and lower bounds for 
particular loss functions or particular classes of functions. 

5.1 Absolute loss 

We verify that the general statements recover the correct rates for the case of £{y,y) = \y- y\. Since the absolute 
loss is not strongly convex, we take K = 0 (and A = 0). Theorem 10 then yields the rate for p e (0,2) 

and 0 {n~^'P) for p > 2, up to logarithmic factors. These rates are matched, again up to logarithmic factors, in 
Theorem 11. Of course, the result already follows from Theorem 2. 

It is also instructive to check the case of r oo. In this case, if K is scaled properly by the range of function 
values, the function A approaches the zero function, indicating absence of strong convexity of the loss. Examin¬ 
ing the power 2 (^r-i)+p Theorem 10, we see that it approaches 1/2, matching the discussion of the preceding 
paragraph. 

5.2 Square loss 

The case of square loss £iy,y) = (y- y)^ has been studied in [22]. In view of Remark 1, we state the corollary below 
in terms of £2 covering numbers, thus removing some logarithmic terms of Theorem 10. 

Corollary 12. Foraclass^ with sequential entropy growthlog^Azif,^, n) < I3~P, 

• For p> 2, the minimax regref is bounded as Cn~^^P 

• For p E [0,2], the minimax regret is bounded as T !/„ < 

• For the parametric case {6) , ^V„ < Cdn~^ login) 

• For finite set , TVn < Cn“'^log|.^| 

Corollary 13. The upper bounds of Corollary 12 are tighf: 

• For p>2, for any class S' of uniformly bounded functions with a lower bound off~P on sequential entropy 
growth, ^Vn- Ll{n~^^P) 

• For pE [0,2], for any class S of uniformly bounded functions, there exists a slightly modified class S' with the 
same sequential entropy growth such that^Vn > Q.in~^' 

• There exists a class S with the covering number as in (6), such that^Vn > D.idn~^ log(n)) 

^For p = 2, -i V,j < Clog(«)n”l^^. 

^The n(-) notation suppresses logarithmic factors 
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5.3 ^ ^ (1>2) 

Consider the case of £[y,y) - ly- y\‘^, for q e (1,2), which interpolates between the absolute value and square 
losses. 

Corollary 14. Suppose = 3^ = [-1,1] and ^(y, y) = ly - yl'^ for q e {1,2). Assume complexity oft^ as in Theorems 
10 and 11 for some p>0. Then 

— Vfi = 0|min|(^- l)~^ 

5.4 t^-loss for q>2 

It is easy to check that for <7 > 2, y) = 1 • -y|l is t7-uniformly convex, and thus 

A{t)>Ct‘’ 


The upper bound of 


then follows from Theorem 10. 


1 __ 2 _ 
— Vn<Cn 2(?-i)+p 
n 


5.5 Logistic loss 

The loss function ^(y,y) = log(l + exp{-yy}) is strongly convex and smooth if the sets 3^,^ are bounded. This can 
be seen by computing the second derivative with respect to the first argument: 


^"(y,y) = y" 


explyy} 

(1 + exp{yy})2 


We conclude that 

— Vn = 0|min|?i~^, 

Logistic loss is an example of a function with third derivative bounded by a multiple of the second derivative. 
Control of the remainder term in Taylor approximation for such functions is given in [5, Lemma 1]. Other examples 
of strongly convex and smooth losses are the exponential loss and truncated quadratic loss. These enjoy the same 
minimax rate as given above. 


5.6 Logarithmic loss 

The technique developed in this paper is not universal. In particular, it does not yield correct rates for rich classes 
of functions under the loss 

^(y, y) = - log(y) 1 {y = 1} - log(l - y) 1 {y = 0} 

for the problem of probability assignment and a binary alphabet '3'' = {0,1}. The suboptimality of Lemma 3 is due 
to the exploding Lipschitz constant. However, a modified approach is possible, and will be carried out in a separate 
paper. 


5.7 Sparse linear predictors and square loss 

We now focus on quadratic loss and instead detail minimax rates for specific classes of functions. Consider the 
following parametric class. Lef;^ = {gi,...,gM} be a set of M functions such that each gj : [-1,1]. Define,^ 

to be the convex combination of at most s out of these M functions. That is 




f s s 

\ E °^jScTj : o-i:s C [M], V;, > 0, ^ = 1 

11=1 J=i 
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For this example note that the sequential covering number can he easily upper hounded: we can choose 5 out of M 
functions in (^) ways and observe that pointwise metric entropy for convex combination of s bounded functions 
at scale jS is bounded as We conclude that 

leMV , 

From the main theorem, for the case of square loss, the upper bound is 

The extension to other loss functions follows immediately from the general statements. 

5.8 Besov spaces and square loss 

Let .SK" be a compact subset of Let be a ball in Besov space When s> dip, pointwise metric entropy 

bounds at scale p scales as D.{p~‘^'^) [31, p. 20]. On the other hand, when s < dip, and p>2, one can show that 
the space is a p-uniformly convex Banach space. From [26], it can be shown that sequential Rademacher can be 
upper bounded by yielding a bound on minimax rate. These two controls together give the bound on 

the minimax rate. The generic forecaster with Rademacher complexity as relaxation (see Section 6), enjoys the best 
of both of these rates. More specifically, we may identify the following regimes: 

• If s > d/2, the minimax rate is ^Vn < o|n“2s+d j. 

• If s < d/2, the minimax rate depends on the interaction of p and d, s: 

- if p > ^, the minimax rate il4i<o|n“^j, otherwise, the rate is < O^n pj 

5.9 Remarks: Experts, Mixability, and Discretization 

The problem of prediction with expert advice has been central in the online learning literature [9] . One can phrase 
the experts problem in our setting by taking a finite class ^ f^} of functions. It is possible to ensure sub- 

linear regret by following the "advice” {xp of a randomly chosen “expert” It from an appropriate distribution 
over experts. The randomized approach, however, effectively linearizes the problem and does not take advantage 
of the curvature of the loss. The precise way in which the loss enters the picture has been investigated thoroughly 
by Vovk [28] (see also ]15]). Vovk defines a mixability curve that parametrizes achievable regret of a form slightly 
different than (1). Specifically, Vovk allows a constant other than 1 in front of the infimum in the regret definition. 
Such regret bounds are called "inexact oracle inequalities” in statistics. Audibert [2] shows that the mixability con¬ 
dition on the loss function leads to a variance-type bound in his general PAC-based formulation, yet the analysis 
is restricted to the case of finite experts. While it is possible to repeat the analysis in the present paper with a con¬ 
stant other than 1 in front of the comparator, this goes beyond the scope of the paper. Importantly, our techniques 
go beyond the finite case and can give correct regret bounds even if discretization to a finite set of experts yields 
vacuous bounds. 

Let us emphasize the above point again by comparing the upper bound of Lemma 5 to the bound we may 
obtain via a metric entropy approach, as in the work of [3 1 ]. Assume that is a compact subset of C 13^) equipped 
with supremum norm. The metric entropy, denoted hy is the logarithm of the smallest e-net with respect 

to the sup norm on 2^. An aggregating procedure over the elements of the net gives an upper bound (omitting 
constants and logarithmic factors) 


ne + ,^{£,3^) (18) 

on regret (1). Here, ne is the amount we lose from restricting the attention to the e-net, and the second term ap¬ 
pears from aggregation over a finite set. The balance (18) fails to capture the optimal behavior for large nonpara- 
metric sets of functions. Indeed, for an 0(e“P) behavior of metric entropy, Vovk concludes the rate of O . For 
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p <2, this is slower than the O^nP+^j rate one obtains from Lemma 5 hy trivially upper hounding the sequential 
entropy hy metric entropy. The gain is due to the chaining technique, a phenomenon well-known in statistical 
learning theory. Our contribution is to introduce the same concepts to the domain of online learning. 


6 Relaxations and Algorithms 

To design generic forecasters for the problem of online non-parametric regression we follow the recipe provided 
in [19]. It was shown in that paper that if one can find a relaxation Rel„ (a sequence of mappings from observed 
data to reals) that satisfies certain conditions, then one can define prediction strategies based on such relaxations. 
Specifically, we look for relaxations that satisfy the initial condition 

n 

Rel„[xi:„,yi-n]>- inf ^^(/(Xf),yf) 

t=i 

and the recursive admissibility condition that requires 


infsup{^(yf,yf) -t Rel„ (xi:f,yi:()} < Rel„ (xi:f_i,yi:t_i) (19) 

ft yt 

for any t e [n] and any Xf e .ST. A relaxation Rel„ satisfying these two conditions is said to be admissible, and it 
leads to an algorithm 


yf = argrnin sup {^(y,yf)-i-Rel„ (xi:f,yi:t)}. 

yu'S/ 


( 20 ) 


For this forecast the associated bound on regret is 

n n 

Reg„:= 2^^(yf,yf)- inf ^ ^(/(Xf),yt) < Rel„ (0) 

t=l f€S^t=l 


( 21 ) 


(see [19] for details). We now claim that the following conditional version of (8) gives an admissible relaxation and 
leads to a method that enjoys the regret bounds shown in the first part of the paper. 

Lemma 15. The following relaxation is admissible: 

iH„(xi:t,yi:f) = supEgSup ^ 2Gej(/(Xy(c))-/r -(e))-A(/(Xj(c))-/r •(£■)]-^^(/(Xj),yj) . 

X,/< f€3^[j=t+l ^ ’ j = l 


The algorithm (20) with this relaxation enjoys the regret bound of ojfset Rademacher complexity 

n , , 

Reg„ < supEeSup ^2Gef(/(xt(e))-Rt(£’))-A/(Xf(£■))-Rt(e) . 
x,/i /e.^if=i '■ 

The proof of Lemma 15 follows closely the proof of Lemma 3 and we omit it (see [19, 20]). Since the regret 
bound for the above forecaster is exactly the one given in (8), the upper bounds in Corollary 12 hold for the above 
algorithm. Therefore, the algorithm based on 91fi(xi:f,yi:t) is optimal up to the tightness of the upper and lower 
bounds in Section 4 and Section 3. 

For the rest of this section, we restrict our attention to the case when -'9 - [-B,B]. We further assume that 
^(y.yf) + Rein (xi:f, (yi:f-i,yf)) is a convex function of yt- In this case, the prediction ft takes a simple form, as the 
supremum over yt is attained either at B or -B. More precisely, the prediction can be written as 

ft = argmin max{^(y,R) -i-Rel„ (xi^f, {yi:t-i,B)) ,£{f,-B) + Rel„ [xi-j, (yi:f-i,-R))}. (22) 
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6.1 Recipe for designing online regression algorithms for general loss functions 

We now provide a schema for deriving forecasters for general online non-parametric regression: 

1. Find relaxation Rel„ s.t ^ ^eln{xi:t,yv.t), and £[y,yt) + Rel„[xi-,t,yi:t-i,yt) is convex in yf. 

2. Check the condition 

sup linfEy „ \£[yt,yt)]+Ey.^p. [Rel„ (xi:f,yi.t)J I < Rel„ (xi:t_i,yi.f_i) 
x,eX,p,EAa-B,B]] (yt J 

3. Given Xt, the prediction ft is given by 

ft = argmin max{/(y,B) + Rel„ [xyj, (yi:f-i,R)) ,^(y,-B) + Rel„ [xnt, (yiu-i.-R))} 

M-b,b] 


Proposition 16. For any algorithm derived from the above schema, Reg^ < Rel„ (0). 

The proof of this Proposition follows very closely the proof in [19] (see also [20]), and we omit it. 


6.1.1 Square Loss 

To provide concrete examples of how the recipe can be used to derive algorithms, we now consider the square 
loss setting, £{f,y] = (y- y)^- In this case, we observe that in (22), the first term in the maximum decreases as y 
increases to B and likewise the second term monotonically decreases as y decreases to -B. Hence, the solution to 
(22) is given when both terms are equal (if this does not happen within the range l-B,B] then we clip the prediction 
to this range). In other words, for the case of square loss, if we have an admissible relaxation, then the prediction 
based on this relaxation is simply given by: 


ft = Clip 


Rel„ (yi:t i,B)) - Rel„ [xpt, [ypt- 


4B 

} 


where Clip(z) - Bl{z> B] + (-R)l (z < -B} + zl[z e [-R,R]}. Hence, for any admissible relaxation such that (y- 
yt)^ + Rel„ [xy t, (yiu-iiTf)) is a convex function of yt, the above prediction based on the relaxation enjoys the 
bound on regret Rel„. Based on the above observations and the recipe outline, we now provide two examples of 
for which algorithms are derived. 


Example: Finite class of experts 

As an example of estimator derived from the schema for the square loss learning setting, we first consider the 
simple case |.^| < oo. 

Corollary 17. The following is an admissible relaxation: 



It leads to the algorithm 


ft = Clip 


^4 °^[l.fet^exp[-B-2Y.y\if{xj)-yfr-B-Hf{XtHB)^) j 


which enjoys a regret bound Reg„ < B^ log | . 
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Example: Linear regression 

Next, consider the problem of online linear regression in 0?'^. Here is the class of linear functions. For this 
problem we consider a slightly modified notion of regret, 


Y.^9t-ytf- inf \ -ytf + MlfWl]■ 

t=l U=1 J 


This regret can be seen alternatively as regret if we assume that on rounds -d+ 1 to 0 Nature plays (Aei,0). 

(Aed,0), where {e;} are the standard basis vectors, and that on these rounds the learner (knowing this) predicts 0, 
thus incurring zero loss over these initial rounds. We can readily apply the schema for designing an algorithm for 
this problem. 


Corollary 18. For any A > 0, the following is an admissible relaxation: 


Rel„ [xi-t.yi-.t) = 


Lyj^ 

i=i 






+ 4B log 


-^yr 

J=1 


It leads to the Vovk-Azoury-Warmuth forecaster [29, 4] 



1 

i * 

-1 

It-l D 

yt = Clip 


y Xjx9 + XI 


E yj^j 


1 

j=i 


u=l II 


and enjoys the regret bound 


-iiyt-ytf< 

n t=i 


1 

n 




t=l 



n 


The proofs of Corollaries 17 and 18 already appeared in [23], and we omit them here. 


7 Discussion and Related Work 

In the past twenty years, progress in online regression for arbitrary sequences, starting with the paper of [10], has 
been almost exclusively on finite-dimensional linear regression (an incomplete list includes [29, 15, 17, 30, 6, 3, 
4, 16, 11]). This is to be contrasted with Statistics, where regression has been studied for rich (nonparametric) 
classes of functions. Important exceptions to this limitation in the online regression framework - and works that 
partly motivated the present findings - are the papers of [33, 31, 32]. Vovk considers regression with large classes, 
such as subsets of a Besov or Sobolev space, and remarks that there appears to be two distinct approaches to 
obtaining the upper bounds in online competitive regression. The first approach, which Vovk terms Defensive 
Forecasting, exploits uniform convexity of the space, while the second - an aggregating technique (such as the 
Exponential Weights Algorithm) - is based on the metric entropy of the space. Interestingly, the two seemingly 
different approaches yield distinct upper bounds, based on the respective properties of the space. In particular, 
Vovk asks whether there is a unified view of these techniques. The present paper addresses these questions and 
establishes optimal performance for online regression. 

Since most work in online learning is algorithmic, the boundaries of what can be proved are defined by the 
regret minimization algorithms one can find. One of the main algorithmic workhorses is the aggregating procedure 
(or, the Exponential Weights Algorithm). However, the difficulty in using an aggregating procedure beyond simple 
parametric classes (e.g. subsets of IR^) lies in the need for a “pointwise” cover of the set of functions - that is, a 
data-independent cover in the supremum norm on the underlying space of covariates. The same difficulty arises 
when one uses PAC-Bayesian bounds [2] that, at the end of the day, require a volumetric argument. Notably, this 
difficulty has been overcome in statistical learning, where it has long been recognized (since the work of Vapnik and 
Chervonenkis) that it is sufficient to consider an empirical cover of the class - a potentially much smaller quantity. 
Such an empirical entropy is necessarily finite, and its growth with n is one of the key complexity measures for 
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i.i.d. learning. In particular, the recent work of [27] shows that the behavior of empirical entropy characterizes the 
optimal rates for i.i.d. learning with square loss. To mimic this development, it appears that we need to understand 
empirical covering numbers in the sequential prediction framework. 

A hint as to how to modify the analysis of [24] for “curved” losses appears in the paper of [ 8 ] where the authors 
derived rates for log-loss via a two-level procedure: the set of densities is first partitioned into small balls of a 
critical radius 7 ; a minimax algorithm is employed on each of these small balls; and an overarching aggregating 
procedure combines these algorithms. Regret within each small ball is upper bounded by classical Dudley entropy 
integral (with respect to a pointwise metric) defined up to the 7 radius. The main technical difficulty in this paper 
is to prove a similar statement using “empirical” sequential covering numbers. 

Interestingly, our results imply the same phase transition as the one exhibited in [26] for i.i.d. learning with 
square loss. More precisely, under the assumption of the behavior of sequential entropy, the minimax 

regret normalized by time horizon n decays as n if p e (0,2], and as for p> 2. We prove lower bounds 
that match up to a logarithmic factor, establishing that the phase transition is real. Even more surprisingly, it 
follows that, under a mild assumption that sequential Rademacher complexity of ^ behaves similarly to its i.i.d. 
cousin, the rates of minimax regret in online regression with arbitrary sequences match, up to a logarithmic factor, 
those in the i.i.d. setting of Statistical Learning. This phenomenon has been noticed for some parametric classes 
by various authors (e.g. [7]). The phenomenon is even more striking given the simple fact that one may convert 
the regret statement, that holds for all sequences, into an i.i.d. guarantee. Thus, in particular, we recover the 
result of [27] through completely different techniques. Since in many situations, one obtains optimal rates for 
i.i.d. learning from a regret statement, the relaxation framework of [19] provides a toolkit for developing improper 
learning algorithms in the i.i.d. scenario. 


A Proofs 


Proof of Lemma 3. Denoting the set of distributions on 0'' by minimax regret can be written as 


V„ = //supinf sup ^ E \\ I ^ £iyt,yt) - inf ^ e[f{xt),yt) 

X, q, p.cS^yt-qtjl /e.^f=l 

yt~Pt 


sup sup Ey ,)) Y, infEyt [^[yt,yt)] - inf Y ^^fM,yt) 

Xt Pt^&> II t=l 1^=1?' f<l3^t=\ 

sup|^inf{Ey, [^(yoTf)]}- ^^(/(xt),yf) 

/e.^ [f=l yt t=l 


(23) 


sup sup Eyj 

Xt Pt^» II 


where the first equality is by definition, the second follows from an argument of [1, 24], and the third is a simple 
rearrangement. Taking y^ = argmin Ey_p^ [^(y,y)], we write the above as 

y£9 


sup sup Ey, 

Xt pte^ 


t=l 


sup I Y Eyt [^(Tt .yt)] - Y ^(/(^f).yf) 

f£^ lt=l f=l 


= ((sup sup Ey, 

Xt pt€.^ 


t=l 


sup \Y ^ iyh y if ixt),yt) 

f£3^ I f=l 


(24) 

(25) 


The last step holds true by observing that the terms C{y*,yt) do not depend on / and can therefore be moved 
outside the supremum over f The equality then follows by the linearity of expectation. By definition of A in 
( 2 ), 


Vn < (( sup sup Ey, 

Xt pt€.^ 


t=l 


sup 

/e^ 


n 


Y ^^iyhyt) ■ (y* - fixt))-Myt 

t=i 



(26) 
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By definition of y^, we have, ^y,~p, ld£iy;,yt)]=dEy,^p, [^(y^,yf)] = 0 by the assumption that the minimum is 
attained in S'" (see Section 2.1). Thus we can view (d^(yf,yf))^^j as a martingale difference sequence. Hence, 


Vfi < It sup sup E 


■yt 


sup 


ptE&> II f€^ I t=l 
By Jensen’s inequality the above is upper bounded by 


E {^^^yhyt'>-^y[~p, [d£iyhy't)]]-{yt-fixt))-Myt-f(xt)) 


sup sup Ey^y 

Xt pt£^ 


sup ^ E (df (y *, yt) - d£(y *, y'^)) ■ (y; - f(xt)) - A(y; - f{xt)) 

f=i ' f=i 


where we introduced Rademacher random variables. The next step involves splitting the upper bound into two 
equal terms, one for the yt sequence and the other for the y[ sequence: 


14 < ((sup sup Ey,.-p,Ee, 

Xt pt€.^ 


t=l 


sup E 2etd^(y;,yf) ■ [y* - /(Xf)) - A(y; - /(Xf)) 
f£3^ I f=l 


Using Jensen’s inequality once again leads to an upper bound of 


sup sup Ey,~p,Ee, 

Xt 


t=l 


sup IE {y *, yt) ■ (y* - fixt)] - A(y; - f{xt)) I 
lt=l J 


Now, observe that y* is a function of pt and d£ly’^,yt) is a function of yt and pt. Hence, we may pass to a further 
upper bound by inserting sup^^ and by replacing each subgradient with the respective rjt and each y* with /if: 


sup sup Ey,..pj sup supEcj 
Xt P,£» t),£l-G,G] 

n 


sup ^ E ■ [pt - f{xt)) - Mpt - f{xt)) 
f£^ I f=l 


sup sup sup Ee, 

Xf 7^ f £[—G,G] 


t=l 


sup E ■ [Pt - fixt)) - Aipt - f{xt)) 

U=1 


Since each rj t range over [- G, G], we can represent it as G times the expectation of a random variable Wf e {-1,1}. 
Denoting this distribution by qt, by Jensen’s inequality 


E 2GetE{ut) {pt - f(xt)) - Mpt - f{xt)) 


Vji < sup E, 


■£t II ] 
x,,pt,qt II t=i 


< sup E„^E,^ 

'\xt,pt,qt 


t=i 


sup 

f=l 


E 2Gef Utifixt) - Pt) - A{pt - f[xt)) 


t=i 


sup max E, 


\xt,/it“te{-l,ll II t=i [f^k 

Since for any fixed zif e {+1, -1}, the distribution of Cf Uf is the same as that of Cf, the above expression is simply 


sup 


E 2GetUt{f{Xt) - pt)-A{pt - f{xt)) 

t=i 


sup Eg 


xt’l‘t llt=l[f^^ 


sup 


E 2Get(/(Xt) - Pt) - A(/if - /(Xf)) 


f=i 


= sup Eg 

x,p 


sup E 2Get (/(Xf(e) - /if (c))) - A (/if(e) - /(Xf(c))) 

f€St ( f=l 


which is the same as the desired upper bound in (8), in the tree notation. 


□ 
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Proof of Lemma 4. It holds that 




maxi y 2CefWf (£■) - A (wdc)) 

,.r,-T/tA 


t=l 




inf-log 
A>0^ 




^ exp A ^2Ce'fWt(e)-A(Wf(c)) 


f=i 


which, by Jensen’s inequality, is upper bounded by 


inf ^ i log 
A>0 


1 


A>0 




y Eg exp A y2CetWt(e)-A(wt(£-)) 


f=i 


inf < — log 


EE. n exp (A (2CetWf (c) - A (Wf (c)))) 

VwetV U=1 


Since + e ^ < 2e^ we have that 

Eg^ [exp(A(2Ce„w„(e)-A(w„(e))))] < exp(2C^A^w„(e)^-AA(w„(e))) 

= exp (2C^A^w„(e)^ - AF (w„(£-)^)) 

By definition of conjugacy, we pass to a further upper bound of 

exp (Ar* (2C^A)) < exp (AF* (2C^A)) 

where the last step is because F* is non-decreasing. Hence we have that 




si log 


Yl exp (A {2C€tWt{e) - A {wtie))]) 

t=i 

' n-l 

Eei.„_i n exp(A(2CefWt(c)-A(Wf(c)))) 
I ■ [ t=i 


-tF*(2C^A) 


Proceeding in similar fashion from n-l down to 1, we arrive at an upper bound of 

ilog|Mi| + nF*(2C2A) 

A 

This proves the first claim. The second statement (which already appears in [24]) is proved similarly, except the 
tuning value A is chosen at the end, and we need to account for the worst-case £2 norm along any paths. We 
provide the proof here for completeness. For any tree w e W, 


E 


exp-j y ACfWfCe) 

t=i 


£1:72-1 


n-l 

< exp-j y ACfWfCe) ^exp{A^w„(e)2/2} 

n-l 


t=l 


<exp< ^ XctWtie) /■maxexp{A^Wrt(e)^/2} 




Continuing in this fashion backwards to f = 1, for any w e W 


exp^ ^ AcfWKc) 


t=i 


< max exp 

ei,...,£n 


\a^i2)t 

i f=i 


w„(e)" 


and thus 


y exp] y ACfWfCe) 
weW U=1 


< |W| max maxexp] (A^/2) y w„(e)^l- . 


weW ( 


f=l 
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Choosing 


we obtain 


A = 


21 og|tV| 


w Z w„ (c)2 


Emax 

weW 


E etWf(e) 

f=i 


< -logE 


^ expj ^ AetWf(e) 
weVl^ it=l 


<, 2\og\W\- max ^w„(c)2 


□ 


Proof of Lemma 5. Fix 7 > 0. Let V be a sequential 7 / 2 -cover of on z in the £^0 sense, i.e. 

Vc, Mge’^, 3V eV' s.t. max|g(Zf(e)) -Vf(e)| < 7/2 


We now modify V' to construct a 7 -cover of ^ on z, which we shall denote by V. The 7 -cover is built as follows. For 
every V eV' we include v in 1/ as defined by a soft-thresholding operation: 


Vee{+l}",VfE [n], 


VfM = 


I 


0 

signCV^(e)) (|v'(e)| - 7 / 2 ) 


if |v'f(e)| < 7/2 
otherwise 


Since we change each v' eV’ only by 7/2 on each coordinate, V is indeed a 7 -cover in the £ca sense. Also note that 
by the way we constructed V from V', we also have that for every e and any there exists a v e V that is 7 - 

close in the £ca sense and for this v, |g(zdc))l 5 |Vf (e)| for every t. Flence, A(g(zt(e))) > A(Vf (e)) by the assumption 
that A is nondecreasing on IR>o and non-increasing on IR<o. Denote such a 7 -close tree v by v[e,g] to make the 
dependence on g,e explicit. Since, for all e and all f, A(g(Zf(c))) 5 A(v[e,g]f (c)) we have. 


Esup 


2C£tg(zt(e)] - A (g(Zf (e))) 


t=i 


(27) 


= Esup 
< Esup 

ge-ig 


2Cetlg(zt[e])-vl£,g]t(e]] + 2C£tvl£,g]t[e] - A(g(Zf(c))) 

t=i ^ ’ 

^ 2Ccf (g(Zf (c)) - v[e,g] f (c)j-t 2Cetv[e,g] f (c) - A(v[e,g] f (£■)) 


Since v[c, g] ranges over V, the last expression is upper bounded by 

n 

^2CefVf(e)-A(v£(e)) 
f=i 


Esup 


^ 2 Cet(g(Zf(e))-v[c, g]t(e)j 


-t Emax 

veV 


(28) 


Now let vie, g] be denoted by g] and V be denoted by V°. Let Pj - 2 “^7 and let V-l denote a sequential Pj- 
cover of on the tree z, for j - N > 1 to be specified later. We can now write the first term in (28) as the 

constant C times 


Esup 

ge.'g 


n N 


^2ct(g(Zf(e))-v^[e,g]f(c)j-t^ ^ 2ef[v-'’[e,g]f(c)-v^ ^[c,g]f(e)] 

t=l t=lj=l 


t=lj=l 
N 


J^2et{g{zt{e])-v^[e,g]t(e)] + Esup J^2et{yHe.g]tie)-v-> ^[e,g]f(e:)] 
‘ ^ j=i ge^lt=i ^ ^ 


t=i 

rN 


< Esup 

ge^S 

Observe that |g(zt(c)) - v™ [e,g]f (c)| < 2/3jv. and hence the first term above is upper-bounded by 4/3]vn. 
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We now upper bound the second term. Fix p e ( 0 , 7 ) and choose N - maxi/ : Pj > 2p}. Then /ijv+i < 2p and 
Pn < 4p. Further, jSiv+i > a. Then second term is upper bounded via Lemma 4 by 

£ 3Pj j2n\og(\Vj\\Vj-^\] < 12^/n Jlog^^[S,^,z)dS 
j=i Jp 

and, finally, tbe second term in (28) is upper bounded via Lemma 4 by 

inf |ilog,y1/»(|,i#,z) + nT* [2C^X] 

A>o ( 


Combining the results. 


Esup 

ge?? 


n 

Y, 2Cetg(zt(e)) - A (g(zt(e))) 


f=i 


<C inf 

pe(0,r) 


^4pn + 12\/7ij ,7)d8 


+ inf-{ ilog^yl/, 
A>0 


fl 
, ( 2 ’ 


?,z) + nr*(2C2A)| 


Since 7 was chosen arhitrarily, the result follows. 


□ 


Proof of Lemma 6. Recall that by definition, 

- £{fixt),yt) = d€{y*,yt) ■ (y* - /(Xf)) - 

From Eq. (24) in the proof of Lemma 3, 


V„ = ( sup sup Ey, 

\\ X, pi^gf 


sup sup Eyj 

xt 


n 

t=l 


n 

t=l 


sup I Y ’ yt'i - , yd I 

/e.^ lf=l J 

IE - yd • (y* - fixt)) - } 


The above inequality holds true if y* ensures E^_p^ [5^(y*iy)] = 0- us now pass to a lower hound hy restricting 
the set of possible optima to be in S and the set of associated distributions to be two-point uniform distributions 
on the corresponding yi (y t ) ’ y2 (y t ) ■ Recall that by definition 


As(/(xd-5) >max{A^^,^^,,A^^(^^,} 


The lower bound is then 
Vn> 


'"’-i 


sup sup Ee, I 

Xt St£.S II 


I t=l 


1 _ 


sup ^ E ■ (^f “ - - As(/(xd - St) 


> R sup Eg 


f€^ lt=l 


1 - 


j E • (/(xt(f) - (UfM)) - -^^11,(0 (/(Xf (e)) - Mtle)) 


for any S-valued tree p. □ 

Proof of Lemma 8. Fix a /) > 0, and set n - fat^gf,^). Suppose x is an 5K'-valued tree of depth n that is ^-shattered 

by,^: 

ye,3fe^ s.t. et{f{'x.t{e)]-pt{e])>fl2 
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where fi is the witness to shattering. Then from (14) with the particular choices of x and fi described above, 


Vn > Esup 
/eJ? 


> Esup 
f€S^ 


>E 


^ R€t(f{xt{€]) - /Xj(e:)) - A^^(e) (/(Xf(e)) - 
t=i 
n 

^ i?Cf (/(Xf (£■)) - /tf(e)) - -|/(Xf(e)) - /tf (C)l 
t=i ^ 


R 


^ 7?ef (/‘^(Xf(e)) - /if (£■)) - -|/'^(xt(c)) - /if(e)| 

t=i 


(29) 

(30) 

(31) 


Using the definition of shattering, we can further lower bound the above quantity by 

Rnp 


E 


E 7l/''(Xf(c))-/if(e)l 

f=i ^ 


Now, suppose fat^(,^) = 1/pP, p> 0. Then n = implies p= n The result follows. □ 

Proof of Lemma 9. Assume that d = fat^g < n. Let z be an -valued tree of depth d that is ^-shattered by 
with a witness tree s. Observe that the functions f^ that guarantee 

VrE[n], et(f{zt(e))-St(e))>pi2 (32) 


do not necessarily take on values close to the Sf (c) +/3/2 interval. We augment with 2‘^ functions g'^ that take on 
the same values as except on points on the z tree where, for some choice k, we have, 

g‘^(Zf(e)) =et;6/2-i-x. 

Let ^ be the resulting class of functions, and - S'\ S'. We now argue that fat^gi,^) cannot be more than 2d + 4, 
as we have only added at most 2'^ functions to S'. Suppose for the sake of contradiction that there exists a tree 
z of depth at least 2d -i- 5 shattered by S. There must exist functions that shatter z and only at most 2'^ of 
them can be from Let us label the leaves of z with the functions that shatter the corresponding path from the 
root; these functions are clearly distinct. Order the leaves of the tree in any way, and observe that there must exist 
a pair of functions from ^ with indices differing by at least 2'^'*''^. It is easy to see that such two leaves can only 
have a common parent at d -i- 3 levels from the leaves, and this yields a complete binary subtree of size d -i-1 that is 
shattered by functions in a contradiction. 

We will now use the function class S to prove a lower bound. Recall that z is an .SK”-valued tree of depth fat^g that 
is ;6-shattered hy'^ ^S, with a witness tree having the constant k at every node. We will now show a construction 
of particular trees of depth 


n 


/ 



(33) 


using the tree z. Define k = f > 1 and consider the 3L -valued tree x and the R-valued tree /i of depth n' 

constructed as foUows. For any path e e {+1}”^ and any t e [n'], set 


Xf(e) =Zrii (e), 
' k ' 

where e e {+1}^®*^ is the sequence of signs specified as 


f 

sign 

( k 

-sign 

( 2it 

E 

,...,sign 

^ kfaX^ V 

E 


U=i 


\j=k+l 
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We now lower bound (14) by choosing the particular x, fi defined above: 


> RE sup 
/eJf 

= RE sup 
/eJf 


^ et(/(Xf(e)) - - - AK(/(xt(e)) - ?f) 

K 


t=l 


^ {€)] -K)- - Ak(/(Zp^^ (€]) - K] 


t=l 


Splitting the sum over t into fat^ blocks, the above expression is equal to 


Esup 

/e^ 


tat/i i-k 


1 


E E ej(/(Zi(^)-x)--A k(/(z;(c))-x) 
i=lj=(/-l)fc+l ^ 


= f? E sup 
/ejr 

= Jf E sup 
f€^ 


fats 


' i-k ' 


^(/(Z;(e))-X) ^ £j 

! = 1 Vt = (l-l)fc+l J 

(.A; 

^ ei(/(z,-(^)-x) ^ c 

1=1 ;=(i-l)fc+l 


-- Ak(/( z,•(£))-X) 
H 


-- AK(/(Zi(^)-x) 
K 


where the last step follows by the definition of e. Recall that z is shattered by the subset and that the functions in 
stay close to the witness tree s. We obtain a lower bound 


R Esup 


fat« 


i=\ 


i-k 

E ‘ 

j=(/-i)fc+i 


-- K) 
K 


fatfl 


>7fE^ 
! = 1 


> R 


i-k 

E ‘ 

j=(i-i)fc+i 


p k_k (p 

2 V 2 f? [ 2 


k 

--Ak 


R 

UJj 


where we used Khinchine’s inequality in the last step. By the definition of k, 

and fat^(,^)fc = n' and so we conclude that, 


V„>> 


Rp 




Since we are free to choose k and p, 


V„i > sup 

P,K 


RP I n'iaX^i^) , 

~V 2 ” 



(34) 


Examining (23), we see that Vn is nondecreasing with n. To see this, let n' > n. For f e {« + 1,..., n'], we may choose 
Pt in (23) as a delta distribution on f*{xt), for any sequence of Xf, where /* is an optimal function over steps 
Clearly, V„/ > In view of (33) and the above discussion, V^' < Vzn-i > and thus 


^212 5 1^20-1 S Vn '- 


□ 
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Proof of Theorem 10. We have Kf for r > 2. It will suffice to take the conjugate oft^K over all of K. A 

straightforward calculation shows that 


, K I2s]t 


2 r-2 Sr-2 

< - 


2e 


Comhining Lemma 3 and Lemma 5, the minimax value V„ is upper bounded by 

inf l4pn+12\/7if ^log.yl^(5,^,+ inf i —log^yKxj n] + nT* (2AG^) 
E(o,r) ( Jp ^ ) A>o (^ 


inf < G 

r>o [ pe(o,r) 

Consider the case p e (0,2). By the assumption on the growth of the covering numbers, and taking p -II n, 


f , n)d8 < f S J\og{nl6)d6 < ^ 

Jp ' Jp * 2—p 

Then (35) is upper bounded by 

I 4G+ Gsfn-^ ^ + inf i —y^^logCn/y) + nT* (2AG^)l 

2 - P A>o I J 


inf] 2 

r>o [ 


We take Y>n‘^ and divide through by log n: 


inf ]c„Gn-iV'’'' + inf]—r-P + r(2AG2)ll 
nlogn n y>n-c ( a>o I J ) 


where Cp is a constant that depends on p. Balancing the terms in the inner infimum, 


inf 

A>0 


1^7 P + r*(2AG2)l= infl^r P + ' 

(nX j ;i>o [ nX ^7^2 


r rp r _ 1 

<Crn 2 (r-l)y 2l.r-l)Qr-lK r-l 


(35) 


where Cr is a constant that depends on r and may change from one expression to next. The value of y that balances 

- CrU~2(r-l) Y~2lr-1) Q — 

^_^ 2 

Y=Cr,pn 2ir-l)+p Q2[r-l)+p K 2(r-l)+p 

and it gives 


Vn ( _1_ _2__2_ 

- — <Crp [n ^f-l)+p Q2lr-1HP K 2(''-l)+p 

n\ogn I ) 


l-p/2 


Gn 


- 1/2 


2-p 


= Cr,pn 2 ^r-lHp Q 2 {r-l)^p ^ 2(r-l)+p 


On the other hand, using [25, Theorem 8], we have 

Vn^cGTfXnim 

In turn, sequential Rademacher complexity 9in(,^) is upper bounded via (9) by either or for 

p>2 and p e (0,2), respectively. More precisely, for the p>2 regime, taking p- we obtain 


n 


- cn ^^P + c\J\og{n)/ 


f Js P\og{nl8)d8 
Jn-ilP ^ 

g(2-p)l2 


2-p 


-Up 


<Cpn ^'PGlog(n). 


The same calculation for p e (0,2) gives iRln(.^) < c^n ^'^^\og{n). 


(36) 

(37) 

(38) 

□ 
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Proof of Theorem 11 . From Lemma 9; 


Vn > sup <( 
/i<l 


RP I 

~Y\I 2 



> sup 
/3<1 




Using p = min1^ 

1 ( R _ ^-P _ 2r _ r 

-V„>Cprmm\ — , K 2(r-i)+p7;2(r-i)+p „ 2(r-i)+p 

n (v/n 

for some constant Cp,r- 


□ 


Proof of Corollary 14. The second derivative of this loss with respect to the first argument is given hy q{q - \)\y - 
and it is lower hounded by because q e (1,2) and y, y e [-1,1]. This means that the loss is q{q- l)/2 
strongly convex and so 


A(x)> 




We now turn to upper bounding A. Choose S = {0} and take yi = 1, y 2 = -1. By symmetry of the loss function, the 
optimal y* - 0, verifying property (12). Then 


Ao(x) = supmax{Ai^.,Ao_^}, 
xesf 


(39) 


with domain??^ - {0} = [-1,1]. For any y e '3'', the generalized binomial theorem gives an expansion of ({•,y) at the 
point fl y as 


^(M y) - y) + dJia, y) • (h - a)] 


E 

1=2 




{a-y)'^ ■' • [b-a)-i 


Then, taking b - x and a-0, 


1=2 r- 

Since q> Iwe can bound the above by 


- “nEok-fc| i 

Ao(x) < ^—^ix|j= E-^ 

h ]'■ [U j'- 


.j-2\ 


|x| 


q{q-l)x 


E 

U=2 


(7-2)! 


x|^ 


i-2 


q{q-l)j^< 


SiCi-Dj 


^ 7 (^?- l)x^ < 2q[q- l)x^. 


The result follow from Theorems 10 and 11. 


□ 
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